Optimizing parameters for machine translation

ABSTRACT

Methods, systems, and apparatus, including computer program products, for language translation are disclosed. In one implementation, a method is provided. The method includes determining, for a plurality of feature functions in a translation lattice, a corresponding plurality of error surfaces for each of one or more candidate translations represented in the translation lattice; adjusting weights for the feature functions by traversing a combination of the plurality of error surfaces for phrases in a training set; selecting weighting values that minimize error counts for the traversed combination; and applying the selected weighting values to convert a sample of text from a first language to a second language.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S.Provisional Patent Application No. 61/078,262, entitled “StatisticalMachine Translation”, which was filed on Jul. 3, 2008. The disclosure ofthe above application is incorporated herein by reference in itsentirety.

BACKGROUND

This specification relates to statistical machine translation.

Manual translation of text by a human operator can be time consuming andcostly. One goal of machine translation is to automatically translatetext in a source language to corresponding text in a target language.There are several different approaches to machine translation includingexample-based machine translation and statistical machine translation.Statistical machine translation attempts to identify a most probabletranslation in a target language given a particular input in a sourcelanguage. For example, when translating a sentence from French toEnglish, statistical machine translation identifies the most probableEnglish sentence given the French sentence. This maximum likelihoodtranslation can be expressed as:

${\underset{e}{\arg \; \max}{P( e \middle| f )}},$

which describes the English sentence, e, out of all possible sentences,that provides the highest value for P(e|f). Additionally, Bayes Ruleprovides that:

${P( e \middle| f )} = {\frac{{P(e)}{P( f \middle| e )}}{P(f)}.}$

Using Bayes Rule, this most likely sentence can be re-written as:

${\underset{e}{\arg \; \max}{P( e \middle| f )}} = {\underset{e}{\arg \; \max}{P(e)}{{P( f \middle| e )}.}}$

Consequently, the most likely e (i.e., the most likely Englishtranslation) is one that maximizes the product of the probability that eoccurs and the probability that e would be translated into f (i.e., theprobability that a given English sentence would be translated into theFrench sentence).

Components that perform translation portions of a language translationtask are frequently referred to as decoders. In certain instances, afirst decoder (a first-pass decoder) can generate a list of possibletranslations, e.g., an N-best list. A second decoder (a second-passdecoder), e.g., a Minimum Bayes-Risk (MBR) decoder, can then be appliedto the list to ideally identify which of the possible translations arethe most accurate, as measured by minimizing a loss function that ispart of the identification. Typically, an N-best list contains between100 and 10,000 candidate translations, or hypotheses. Increasing thenumber of candidate translations improves the translation performance ofan MBR decoder.

SUMMARY

This specification describes technologies relating to languagetranslation.

In general, one aspect of the subject matter described in thisspecification can be embodied in methods that include the actions ofdetermining, for a plurality of feature functions in a translationlattice, a corresponding plurality of error surfaces for each of one ormore candidate translations represented in the translation lattice;adjusting weights for the feature functions by traversing a combinationof the plurality of error surfaces for phrases in a training set;selecting weighting values that minimize error counts for the traversedcombination; and applying the selected weighting values to convert asample of text from a first language to a second language. Otherembodiments of this aspect include corresponding systems, apparatus, andcomputer program products.

These and other embodiments can optionally include one or more of thefollowing features. The translation lattice includes a phrase lattice.Arcs in the phrase lattice represent phrase hypotheses and nodes in thephrase lattice represent states at which partial translation hypotheseswere recombined. The error surfaces are determined and traversed using aline optimization technique. The line optimization technique determinesand traverses, for each feature function and sentence in a group, anerror surface on a set of candidate translations. The line optimizationtechnique determines and traverses the error surface starting from arandom point in a parameter space. The line optimization techniquedetermines and traverses the error surface using random directions toadjust the weights.

The weights are limited by restrictions. The weights are adjusted usingweights priors. The weights are adjusted over all sentences in a groupof sentences. The method further includes selecting a targettranslation, from a plurality of candidate translations, that maximizesa-posteriori probability for the translation lattice. The translationlattice represents more than one billion candidate translations. Thephrases include sentences. The phrases all include sentences.

In general, another aspect of the subject matter described in thisspecification can be embodied in systems that include a language modelthat includes: a collection of feature functions in a translationlattice; a plurality of error surfaces for a set of candidate languagetranslations, across the feature functions; and weighting values forfeature functions selected to minimize error for traversal of the errorsurfaces. Other embodiments of this aspect include correspondingmethods, apparatus, and computer program products.

Particular embodiments of the subject matter described in thisspecification can be implemented to realize one or more of the followingadvantages. MBR decoding of a lattice increases sizes of hypothesis andevidence spaces, thereby increasing a number of candidate translationsavailable and the likelihood of obtaining an accurate translation. Inaddition, MBR decoding provides a better approximation of a corpus BLEUscore (as described in further detail below), thereby further improvingtranslation performance. Furthermore, MBR decoding of a lattice isruntime efficient, thereby increasing the flexibility of statisticalmachine translation since the decoding can be performed at runtime.

Lattice-based Minimum Error Rate Training (MERT) provides exact errorsurfaces for all translations in a translation lattice, thereby furtherimproving translation performance of a statistical machine translationsystem. The systems and techniques for lattice-based MERT are also spaceand runtime efficient, thereby reducing an amount of memory used, e.g.,limiting memory requirements to be linearly related (at most) with thesize of the lattice, and increasing a speed of translation performance.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a conceptual diagram of an example process for translatinginput text from a source language to a target language.

FIG. 2A illustrates an example translation lattice.

FIG. 2B illustrates an example MBR automaton for the translation latticeof FIG. 2A.

FIG. 3 illustrates a portion of an example translation lattice.

FIG. 4 shows an example process for MBR decoding.

FIG. 5A shows an example process for Minimum Error Rate Training (MERT)on a lattice.

FIG. 5B illustrates an example Minimum Error Rate Trainer.

FIG. 6 shows an example of a generic computer device and a genericmobile computer device.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION Statistical Translation Overview

Machine translation seeks to take input text in one language andaccurately convert it into text in another language. Generally, theaccuracy of a translation is measured against the ways in which experthumans would translate the input. An automatic translation system cananalyze prior translations performed by human experts to form astatistical model of translation from one language to another. No suchmodel can be complete, however, because the meaning of words oftendepends on context. Consequently, a step-wise word-for-wordtransformation of words from one language to another may not provideacceptable results. For example, idioms such as “babe in the woods” orslang phrases, do not translate well in a literal word-for-wordtransformation.

Adequate language models can help provide such context for an automatictranslation process. The models can, for example, provide indicationsregarding the frequency with which two words appear next to each otherin normal usage, e.g., in training data, or that other groups ofmultiple words or elements (n-grams) appear in a language. An n-gram isa sequence of n consecutive tokens, e.g., words or characters. An n-gramhas an order or size, which is the number of tokens in the n-gram. Forexample, a 1-gram (or unigram) includes one token; a 2-gram (or bi-gram)includes two tokens.

A given n-gram can be described according to different portions of then-gram. An n-gram can be described as a context and a future token,(context, w), where the context has a length n−1 and w represents thefuture token. For example, the 3-gram “c₁c₂c₃” can be described in termsof an n-gram context and a future token, where c₁, c₂, and c₃ eachrepresent a character. The n-gram left context includes all tokens ofthe n-gram preceding the last token of the n-gram. In the given example,“c₁c₂” is the context. The left most token in the context is referred toas the left token. The future token is the last token of the n-gram,which in the example is “c₃”. The n-gram can also be described withrespect to a right context. The right context includes all tokens of then-gram following the first token of the n-gram, represented as a(n−1)-gram. In the example above, “c₂c₃” is the right context.

Each n-gram can have an associated probability estimate, e.g., alog-probability, that is calculated as a function of a count ofoccurrences in training data relative to a count of total occurrences inthe training data. In some implementations, the probabilities of n-gramsbeing a translation of input text is trained using the relativefrequency of the n-grams represented in a target language as a referencetranslation of corresponding text in a source language in training data,e.g., training data including a set of text in the source language andcorresponding text in the target language.

Additionally, in some implementations, a distributed trainingenvironment is used for large training data (e.g., terabytes of data).One example technique for distributed training is MapReduce. Details ofMapReduce are described in J. Dean and S. Ghemawat, MapReduce:Simplified Data Processing on Large Clusters, Proceedings of the 6thSymposium on Operating Systems Design and Implementation, pp. 137-150(Dec. 6, 2004).

Past usage represented by a training set can be used to predict howsamples in one language should be translated to a target language. Inparticular, the n-grams, associated probability estimates, andrespective counts can be stored in a language model for use by adecoder, e.g., a Bayesian decoder to identify translations for inputtext. A score indicating the likelihood that input text can betranslated to corresponding text in a target language can be calculatedby mapping the n-grams included in the input text to associatedprobability estimates for a particular translation.

Example Translation Process

FIG. 1 is a conceptual diagram of an example process 100 for translatinginput text from a source language to a target language. A source sample102 is shown as a passage of Chinese text, and is provided to a firstdecoder 104. The decoder 104 can take a variety of forms and can be usedin an attempt to maximize a posterior probability for the passage, givena training set of documents 106 that has been provided to the decoder104 during a training phase for the decoder 104. In translating thesample 102, the decoder 104 can select n-grams from within the documentand attempt to translate the n-grams. The decoder 104 can be providedwith a re-ordering model, alignment model, and language model, amongother possible models. The models direct the decoder 104 in selectingn-grams from within the sample 102 for translation. As one simpleexample, the model can use delimiters, e.g., punctuation such as a commaor period, to identify the end of an n-gram that may represent a word.

The decoder 104 can produce a variety of outputs, e.g., data structuresthat include possible translations. For example, the decoder 104 canproduce an N-best list of translations. In some implementations, thedecoder 104 generates a translation lattice 108, as described in furtherdetail below.

A second decoder 110 then processes the translation lattice 108. Whilethe first decoder 104 is generally aimed at maximizing the posteriorprobability of the translation, i.e., matching the input to what thehistorical collection of documents 106 may indicate to be a best matchto past expert manual translations of other passages, the second decoder110 is aimed at maximizing a quality measure for the translation. Assuch, the second decoder 110 may re-rank the candidate translations thatreside in the translation lattice so as to produce a “best” translationthat may be displayed to a user of the system 100. This translation isrepresented by the English sample 112 corresponding to the translationof the Chinese sample 102.

The second decoder 110 can use a process known as MBR decoding, whichseeks the hypothesis (or candidate translation) that minimizes theexpected error in classification. The process thus directly incorporatesa loss function into the decision criterion for making a translationselection.

Minimum Bayes Risk Decoding

Minimum Bayes-Risk (MBR) decoding aims to find a translation hypothesis,e.g., a candidate translation, that has the least expected error underthe probability model. Statistical machine translation can be describedas mapping of input text F in a source language to translated text E ina target language. A decoder δ(F), e.g., decoder 104, can perform themapping. If the reference translation E is known, the decoderperformance can be measured by the loss function L(E, δ(F)). Given sucha loss function L(E, E′) between an automatic translation E′ and thereference translation E, and an underlying probability model P(E, F),the MBR decoder, e.g., the second decoder 110, can be represented by:

${\hat{E} = {{\underset{E^{\prime} \in \Psi}{\arg \; \min}{R( E^{\prime} )}} = {\underset{E^{\prime} \in \Psi}{\arg \; \min}{\sum\limits_{E^{\prime} \in \Psi}{{L( {E,E^{\prime}} )}{P( E \middle| F )}}}}}},$

where R(E) represents the Bayes risk of candidate translation E′ underthe loss function L, and Ψ represents the space of translations. ForN-best MBR, the space Ψ is an N-best list produced, for example, by thefirst decoder 104. When a translation lattice is used, Ψ representscandidate translations encoded in the translation lattice.

If the loss function between any two hypotheses can be bounded, i.e.,L(E, E′)≦L_(max), the MBR decoder can be written in terms of a gainfunction, G(E, E′)=L_(max)−L(E, E′), as:

$\begin{matrix}{\hat{E} = {\underset{E^{\prime} \in \Psi}{\arg \; \max}{\sum\limits_{E^{\prime} \in \Psi}{{G( {E,E^{\prime}} )}{{P( E \middle| F )}.}}}}} & ( {{Eq}.\mspace{14mu} 1} )\end{matrix}$

In some implementations, MBR decoding uses different spaces forhypothesis selection and risk computation. For example, the hypothesiscan be selected from an N-best list and the risk can be computed basedon a translation lattice. In the example, the MBR decoder can berewritten as:

${\hat{E} = {\underset{E^{\prime} \in \Psi_{h}}{\arg \; \max}{\sum\limits_{E^{\prime} \in \Psi_{e}}{{G( {E,E^{\prime}} )}{P( E \middle| F )}}}}},$

where Ψ_(h) represents the hypothesis space and Ψ_(e) represents anevidence space used for computing Bayes risk.

MBR decoding can be improved by using larger spaces, i.e., hypothesisand risk computation spaces. Lattices can include more candidatetranslations than an N-best list. For example, lattices can include morethan one billion candidate translations. As such, representing thehypothesis and risk computation spaces using lattices increases theaccuracy of MBR decoding, thereby increasing the likelihood that anaccurate translation is provided.

Example Translation Lattice and MBR Decoding

FIG. 2A illustrates an example translation lattice 200. In particular,translation lattice 200 is a translation n-gram lattice that can beconsidered to be a compact representation for very large N-best lists oftranslation hypotheses and their likelihoods. Specifically, the latticeis an acyclic weighted finite state acceptor including states (e.g.,states 0 through 6) and arcs representing transitions between states.Each arc is associated with an n-gram (e.g., a word or phrase) and aweight. For example, in translation lattice 200, n-grams are representedby labels “a”, “b”, “c”, “d”, and “e”. State 0 is connected to a firstarc that provides a path to state 1, a second arc that provides a pathto state 4 from state 1, and a third arc that provides a path to state 5from state 4. The first arc is associated with “a” and weight 0.5, thesecond arc is associated with “b” and weight 0.6, and the third arc isalso associated with “d” and weight 0.3.

Each path in the translation lattice 200, including consecutivetransitions beginning at an initial state (e.g., state 0) and ending ata final state (e.g., state 6), expresses a candidate translation.Aggregation of the weights along a path produces a weight of the path'scandidate translation H(E, F) according to the model. The weight of thepath's candidate translation represents the posterior probability of thetranslation E given the source sentence F as:

${{P( E \middle| F )} = \frac{\exp ( {\alpha \cdot {H( {E,F} )}} )}{\sum\limits_{E^{\prime} \in \Psi}{\exp ( {\alpha \cdot {H( {E^{\prime},F} )}} )}}},$

where αε(0, ∞) is a scaling factor that flattens the distribution whenα<1, and sharpens the distribution when α>1.

In some implementations, a gain function G is expressed as a sum oflocal gain functions g_(i). A gain function can be considered to be alocal gain function if it can be applied to all paths in the latticeusing Weighted Finite State Transducers (WFSTs) composition, resultingin a o(N) increase in the number of states N in the lattice. The localgain functions can weight n-grams. For example, given a set of n-gramsN={w₁, . . . , w_(|N|)}, a local gain function g_(w):ε×ε→

, where wεN, can be expressed as:

g _(w)(E|E′)=θ_(w)·#_(w)(E′)·δ_(w)(E),

where δ_(w) is a constant, #_(w) (E′) is a number of times that w occursin E′, and δ_(w)(E) is 1 if wεE and 0 otherwise. Assuming that theoverall gain function G(E, E′) can be written as a sum of local gainfunctions and a constant θ₀ times the length of the hypothesis E′, theoverall gain function can be expressed as:

${G( {E,E^{\prime}} )} = {{{\theta_{0}{E^{\prime}}} + {\sum\limits_{w \in N}{g_{w}( E \middle| E^{\prime} )}}} = {{\theta_{0}{E^{\prime}}} + {\sum\limits_{w \in N}{{\theta_{w} \cdot \#_{w}}{( E^{\prime} ) \cdot {{\delta_{w}(E)}.}}}}}}$

Using this overall gain function, the risk, i.e.,

${\sum\limits_{E^{\prime} \in \Psi}{{G( {E,E^{\prime}} )}{P( E \middle| F )}}},$

can be rewritten such that the MBR decoder for the lattice (inEquation 1) is expressed as:

$\begin{matrix}{{\hat{E} = {\underset{E^{\prime} \in \Psi}{\arg \; \max}\{ {{\theta_{0}{E^{\prime}}} + {\sum\limits_{w \in N}{{\theta_{w} \cdot \#_{w}}{( E^{\prime} ) \cdot {P( w \middle| \Psi )}}}}} \}}},} & ( {{Eq}.\mspace{14mu} 2} )\end{matrix}$

where P(w|Ψ) is the posterior probability of the n-gram w in thelattice, or Σ_(EεΨ) _(w) P(E|F), and can be expressed as:

$\begin{matrix}{{{P( w \middle| \Psi )} = {{\sum\limits_{E \in \Psi_{w}}{P( E \middle| F )}} = \frac{Z( \Psi_{w} )}{Z(\Psi)}}},} & ( {{Eq}.\mspace{14mu} 3} )\end{matrix}$

where Ψ_(w)={EεΨ|δ_(w)(E)>0} represents the paths of the latticecontaining the n-gram w at least once, and Z(Ψ_(w)) and Z(Ψ) representthe sum of weights of all paths in the lattices Ψ_(w) and Ψ,respectively.

In some implementations, the MBR decoder (Equation 2) is implementedusing WFSTs. A set of n-grams that are included in the lattice areextracted, e.g., by traversing the arcs in the lattice in topologicalorder. Each state in the lattice has a corresponding set of n-gramprefixes. Each arc leaving a state extends each of the state's prefixesby a single word. N-grams that occur at a state followed by an arc inthe lattice are included in the set. As an initialization step, an emptyprefix can be initially added to each state's set.

For each n-gram w, an automaton (e.g., another lattice) matching pathscontaining the n-gram is generated, and the automaton is intersectedwith the lattice to find a set of paths containing the n-gram, i.e.,Ψ_(w). For example, if Ψ represents the weighted lattice, Ψ_(w) can berepresented as:

Ψ_(w)=Ψ∩(Σ*wΣ*).

The posterior probability P(w|Ψ) of n-gram w can be calculated as aratio of the total weights of paths in Ψ_(w) to the total weights ofpaths in the original lattice Ψ, as given above in Equation 3.

The posterior probability for each n-gram w can be calculated asdescribed above, and then multiplied by θ_(w) (an n-gram factor) asdescribed with respect to Equation 2. An automaton that accepts an inputwith weight equal to a number of times the n-gram occurs in the inputtimes θ_(w) is generated. The automaton can be represented using theweighted regular expression:

w(w/(θ_(w)P(w|Ψ)) w)*,

where w= (Σ*wΣ*) is the language that includes all strings that do notcontain the n-gram w.

Each generated automaton is successively intersected with secondautomatons that each begin as an un-weighted copy of the lattice. Eachof these second automatons is generated by intersecting the un-weightedlattice with an automaton accepting (Σ/θ₀)*. The resulting automatonrepresents the total expected gain of each path. A path in the resultingautomaton that represents a word sequence E′ has a cost:

${\theta_{0}{E^{\prime}}} + {\sum\limits_{w \in N}{{\theta_{w} \cdot \#_{w}}{( E^{\prime} ) \cdot {{P( w \middle| \Psi )}.}}}}$

The path associated with the least cost, e.g., according to Equation 2,is extracted from the resulting automaton, producing the lattice MBRcandidate translation.

In implementations where the hypothesis and evidence spaces lattices aredifferent, the evidence space lattice is used for extracting the n-gramsand computing associated posterior probabilities. The MBR automaton isconstructed starting with an un-weighted copy of the hypothesis spacelattice. Each of the n-gram automata is successively intersected withthe un-weighted copy of the hypothesis space lattice.

An approximation to the BLEU score is used to calculate a decompositionof the overall gain function G(E, E′) as a sum of local gain functions.A BLEU score is an indicator of translation quality of text which hasbeen machine translated. Additional details of Bleu are described in K.Papineni, S. Roukes, T. Ward, and W. Zhu. 2001. Bleu: a Method forAutomatic Evaluation of Machine Translation. Technical Report RC22176(W0109-022), IBM Research Division. In particular, the system calculatesa first order Taylor-series approximation to the change in corpus BLEUscore from including a sentence to not including the sentence in thecorpus.

Given a reference length r of a corpus (e.g., a length of a referencesentence, or a sum of the lengths of multiple reference sentences), acandidate length c₀, and a number of n-gram matches {c_(n)|1≦n≦4}, thecorpus BLEU score B(r, c₀, c_(n)) can be approximated as:

$\quad\begin{matrix}{{\log \; B} = {{\min ( {0,{1 - \frac{4}{c_{0}}}} )} + {\frac{1}{4}{\sum\limits_{n = 1}^{4}{\log \frac{c_{n}}{c_{0} - \Delta_{n}}}}}}} \\{{\approx {{\min ( {0,{1 - \frac{4}{c_{0}}}} )} + {\frac{1}{4}{\sum\limits_{n = 1}^{4}{\log \; \frac{c_{n}}{c_{0}}}}}}},}\end{matrix}$

where Δ_(n), the difference between a number of words in the candidateand the number of n-grams: Δ_(n)=n−1, is assumed to be negligible.

The corpus log(BLEU) gain is defined as the change in log(BLEU) when anew sentence's (E′) statistics is included in the corpus statistics, andexpressed as:

G=log B′−log B,

where the counts in B′ are those of B added to the counts for thecurrent sentence. In some implementations, an assumption that c≧r isused, and only c_(n) is treated as a variable. Therefore, the corpus logBLEU gain can be approximated by a first-order vector Taylor seriesexpansion about the initial values of c_(n) as:

${G =  {\sum\limits_{n = 0}^{N}{( {c_{n}^{\prime} - c_{n}} )\frac{{\partial\log}\; B^{\prime}}{\partial c_{n}}}} |_{c_{n}^{\prime} = c_{n}}},$

where the partial derivatives are expressed as:

${\frac{{\partial\log}\; B}{\partial c_{0}} = \frac{- 1}{c_{0}}},{{{and}\mspace{14mu} \frac{{\partial\log}\; B}{\partial c_{n}}} = {\frac{1}{4c_{n}}.}}$

Therefore, the corpus log(BLEU) gain can be rewritten as:

${G = {{\Delta \; \log \; B} \approx {{- \frac{\Delta \; c_{0}}{c_{0}}} + {\frac{1}{4}{\sum\limits_{n = 1}^{4}\frac{\Delta \; c_{n}}{c_{n}}}}}}},$

where the Δ terms count various statistics in a sentence of interest,rather than the corpus as a whole. These approximations suggest that thevalues of θ₀ and θ_(w) (e.g., in Equation 2) can be expressed as:

${\theta_{0} = \frac{- 1}{c_{0}}},{{{and}\mspace{14mu} \theta_{w}} = {\frac{1}{4c_{w}}.}}$

Assuming that the precision of each n-gram is a constant ratio r timesthe precision of a corresponding (n−1)-gram, the BLEU score can beaccumulated at the sentence level. For example, if the average sentencelength in a corpus is assumed to be 25 words, then:

$\frac{\# \mspace{14mu} (n)\mspace{14mu} {gram\_ tokens}}{\# \mspace{14mu} ( {n - 1} )\mspace{14mu} {gram\_ tokens}} = {{1 - \frac{1}{25}} = {0.96.}}$

If the unigram precision is p, the n-gram factors (nε{1,2,3,4}), as afunction of the parameters p and r and the number of unigram tokens T,can be expressed as:

${\theta_{0} = \frac{- 1}{T}},{{{and}\mspace{14mu} \theta_{w}} = {\frac{1}{{{Tp} \cdot 4}( {0.96 \cdot r} )^{n}}.}}$

In some implementations, p and r are set to the average values ofunigram precision and precision ratio across multiple training sets.Substituting the n-gram factors in Equation 2 provides that the MBRdecoder, e.g., a MBR decision rule, does not depend on T and multiplevalues of T can be used.

FIG. 2B illustrates an example MBR automaton for the translation latticeof FIG. 2A. The bold path in the translation lattice 200 in FIG. 2A is aMaximum A Posteriori (MAP) hypothesis, and the bold path in the MBRautomaton 250 in FIG. 2B is an MBR hypothesis. In the exampleillustrated by FIGS. 2A and 2B, T=10, p=0.85, and r=0.75. Note that theMBR hypothesis (bcde) has a higher decoder cost relative to the MAPhypothesis (abde). However, bcde receives a higher expected gain thanabde since it shares more n-grams with the third ranked hypothesis(bcda), illustrating how a lattice can help select MBR translations thatare different from a MAP translation.

Minimum Error Rate Training (MERT) Overview

Minimum error rate training (MERT) measures an error metric of adecision rule for classification, e.g., MBR decision rule using azero-one loss function. In particular, MERT estimates model parameterssuch that the decision under the zero-one loss function maximizes anend-to-end performance measure on a training corpus. In combination withlog-linear models, the training procedure optimizes an unsmoothed errorcount. As previously stated, the translation that maximizes thea-posteriori probability can be selected based on

$\underset{e}{argmax}{{P( e \middle| f )}.}$

Since the true posterior distribution is unknown, P(e|f) is approximatedwith a log-linear translation model, for example, which combines one ormore feature functions h_(m)(e,f) with feature function weights λ_(m),where m=1, . . . , M. The log-linear translation model can be expressedas:

${P( e \middle| f )} = {{P_{\lambda_{1}^{M}}( e \middle| f )} = {\frac{\exp \lbrack {\sum\limits_{m = 1}^{M}{\lambda_{m}{h_{m}( {e,f} )}}} \rbrack}{\sum\limits_{e^{\prime}}{\exp \lbrack {\sum\limits_{m = 1}^{M}{\lambda_{m}{h_{m}( {e^{\prime},f} )}}} \rbrack}}.}}$

The feature function weights are the parameters of the model, and theMERT criterion finds a parameter set λ₁ ^(M) that minimizes the errorcount on a representative set of training sentences using the decisionrule, e.g., P(e|ƒ). Given source sentences f₁ ^(S) of a training corpus,reference translations r₁ ^(S), and a set of K candidate translationsC_(s)={e_(s,1), . . . e_(s,K)}, the corpus-based error count fortranslations e₁ ^(S) is additively decomposable into error counts ofindividual sentences, i.e.,

${E( {r_{1}^{S},e_{1}^{S}} )} = {\sum\limits_{s = 1}^{S}{{E( {r_{1},e_{1}} )}.}}$

The MERT criterion can be expressed as:

$\begin{matrix}{\begin{matrix}{\lambda_{1}^{M} = {\underset{\lambda_{1}^{M}}{argmin}\{ {\sum\limits_{s = 1}^{S}{E( {r_{s},{\hat{e}( {f_{s};\lambda_{1}^{M}} )}} )}} \}}} \\{{= {\underset{\lambda_{1}^{M}}{argmin}\{ {\sum\limits_{s = 1}^{S}{\sum\limits_{k = 1}^{K}{{E( {r_{s},r_{s,k}} )}{\delta ( {{\hat{e}( {f_{s};\lambda_{1}^{M}} )},e_{s,k}} )}}}} \}}},}\end{matrix}{{{where}\mspace{14mu} {\hat{e}( {f_{s},\lambda_{1}^{M}} )}} = {\underset{e}{argmax}{\{ {\sum\limits_{m = 1}^{M}{\lambda_{m}{h_{m}( {e,f_{s}} )}}} \}.}}}} & ( {{Eq}.\mspace{14mu} 4} )\end{matrix}$

A line optimization technique can be used to train a linear model underthe MERT criterion. The line optimization determines, for each featurefunction h_(m) and sentence f_(s), the exact error surface on a set ofcandidate translations C_(s). The feature function weights are thenadjusted by traversing the combined error surfaces of sentences in thetraining corpus and setting weights to a point where the resulting erroris a minimum.

The most probable sentence hypothesis in C_(s) along a line λ₁ ^(M)+γ·d₁^(M) can be defined as:

${\hat{e}( {f_{s};\gamma} )} = {\underset{e \in C_{s}}{argmax}{\{ {( {\lambda_{1}^{M} + {\gamma \cdot d_{1}^{M}}} )^{T} \cdot {h_{1}^{M}( {e,f_{s}} )}} \}.}}$

The total score for any candidate translation corresponds to a line inthe plane with γ as the independent variable. Overall, C_(s) defines Klines where each line may be divided into at most K line segments due topossible intersections with other K−1 lines.

For each γ, the decoder (e.g., the second decoder 110) determines arespective candidate translation that yields the highest score andtherefore corresponds to a topmost line segment. A sequence of topmostline segments constitute an upper envelope that is a point-wise maximumover all lines defined by C_(s). The upper envelope is a convex hull andcan be inscribed with a convex polygon whose edges are the segments of apiecewise linear function in γ. In some implementations, the upperenvelope is calculated using a sweep line technique. Details of thesweep line technique are described, for example, in W. Macherey, F. Och,I. Thayer, and J. Uzskoreit, Lattice-based Minimum Error Rate Trainingfor Statistical Machine Translation, Proceedings of the 2008 Conferenceon Empirical Methods in Natural Language Processing, pages 725-734,Honolulu, October 2008.

MERT on Lattices

A lattice (e.g., a phrase lattice) for a source sentence f can bedefined as a connected, directed acyclic graph

_(f)=(ν_(f), ε_(f)) with vertices set ν_(f), unique source and sinknodes s, tεν_(f), and a set of arcs; {umlaut over (ε)}_(f)⊂ν_(f)×ν_(f).Each arc is labeled with a phrase ρ_(ij)=e_(i) ₁ , . . . , e_(i) _(j)and the (local) feature function values h₁ ^(M)(ρ_(ij), f) of thisphrase. A path π=(v₀, ε₀, v₁, ε₁, . . . , ε_(n−1), v_(n)) in G_(f) (withε_(i)εε_(f) and v_(i), v_(i+1)εν_(f) as the tail and head of ε_(i),0≦i<n) defines a partial translation e (of f), which is theconcatenation of all phrases along this path. Related feature functionvalues are obtained by summing over the arc-specific feature functionvalues:

${{\pi {\text{:}\mspace{14mu} \underset{v_{0}}{\bullet}}}\overset{\phi_{0,1}}{\underset{h_{1}^{M}({\phi_{0,1},f})}{arrow}}{\underset{v_{1}}{\bullet}{\overset{\phi_{1,2}}{\underset{h_{1}^{M}({\phi_{1,2},f})}{arrow}}{\ldots \overset{\phi_{{n - 1},n}}{\underset{h_{1}^{M}({\phi_{{n - 1},n}.f})}{arrow}}{\underset{v_{n}}{\bullet}e_{\pi}}}}}} = {{\underset{i,{{{j\text{:}\mspace{14mu} v_{i}}arrow v_{j}} \in \; \pi}}{◯}\phi_{ij}} = {\phi_{0,1}{\bullet\ldots\bullet\phi}_{{n - 1},n}}}$${h_{1}^{M}( {e_{\pi},f} )} = {\sum\limits_{i,{{{j\text{:}\mspace{14mu} v_{i}}arrow v_{j}} \in \pi}}{h_{1}^{M}( {\phi_{ij},f} )}}$

In the following discussion, the notation enter(ν) and leave(ν) refer tothe set of incoming and outgoing arcs, respectively, for a node vεν_(f).Similarly, head(ε) and tail(ε) denote the head and tail of arc ε,respectively.

FIG. 3 illustrates a portion of an example translation lattice 300. InFIG. 3, incoming arcs 302, 304, and 306 enter node ν 310. In addition,outgoing arcs 312 and 314 exit node ν 310.

Each path that starts at a source node s and ends in ν (e.g., node ν310) defines a partial translation hypothesis that can be represented asa line (cf. Equation 4). Assume that the upper envelope for thesepartial translation hypotheses is known, and the lines that define theenvelope are denoted by f₁, . . . , f_(N). Outgoing arcs ε that areelements of the set leave(ν), e.g., arc 312, represent continuations ofthese partial candidate translations. Each outgoing arc defines anotherline denoted by g(ε). Adding the parameters of g(ε) to all lines in theset f₁, . . . , f_(N) produces an upper envelope defined by f₁+g(ε), . .. , f_(N)+g(ε).

Because the addition of g(ε) does not change the number of line segmentsor their relative order in the envelope, the structure of the convexhull is preserved. Therefore, the resulting upper envelope can bepropagated over an outgoing arc ε to a successor node v′=head(ε). Otherincoming arcs for ν′ may be associated with different upper envelopes.The upper envelopes are merged into a single, combined envelope, whichis the convex hull of the union over the line sets which constitute theindividual envelopes. By combining upper envelopes for each incoming arcν′, the upper envelope for all partial candidate translations that areassociated with paths starting at the source node s and ending in ν′ isgenerated.

Other implementations are possible. In particular, additionalrefinements can be performed to improve the performance of MERT (forlattices). For example, in order to prevent the line optimizationtechnique from stopping in a poor local optimum, MERT can exploreadditional starting points that are randomly chosen by sampling theparameter space. As another example, the range of some or all featurefunction weights can be limited by defining weight restrictions. Inparticular, a weight restriction for a feature function h_(m) can bespecified as an interval

_(m)=[l_(m), r_(m)], l_(m), r_(m)ε

∪{−∞, +∞}, which defines an admissible region from which the featurefunction weight λ_(m) can be selected. If the line optimization isperformed under weights restrictions, γ is selected such that: l₁^(M)≦λ₁ ^(M)+γ·d₁ ^(M)≦r₁ ^(M).

In some implementations, weights priors can be used. Weights priorsprovide a small (positive or negative) boost ω on the objective functionif the new weight is chosen as to match a certain target value λ_(m)*:

$\gamma_{opt} = {\arg \; {\min\limits_{\gamma}\{ {{\sum\limits_{s}{E( {r_{s},{\hat{e}( {f_{s};\gamma} )}} )}} + {\sum\limits_{m}{{\delta ( {{\lambda_{m} + {\gamma \cdot d_{m}}},\lambda_{m}^{*}} )} \cdot \omega}}} \}}}$

A zero weights prior λ_(m)*=0 allows feature selection because theweights of feature functions, which are not discriminative, are set tozero. For example, an initial weights prior λ_(m)*=λ_(m) can be used tolimit changes in parameters, such that an updated parameter set hasfewer differences relative to an initial weights set.

In some implementations, an interval [γ_(i) ^(f) ^(s) , γ_(i+1) ^(f)^(s) ) of a translation hypothesis, which has a change in error countΔE_(i) ^(f) ^(s) that is equal to zero, is merged with an interval[γ_(i−1) ^(f) ^(s) , γ_(i) ^(f) ^(s) ) of its left-adjacent translationhypothesis. The resulting interval [γ_(i−1) ^(f) ^(s) , γ_(i+1) ^(f)^(s) ) has a larger range, and the reliability of a selection of theoptimum value of λ can be increased.

In some implementations, the system uses random directions to updatemultiple feature functions simultaneously. If the directions used inline optimization are the coordinate axes of the M-dimensional parameterspace, each iteration results in an update of a single feature function.While this update technique provides a ranking of the feature functionsaccording to their discriminative power, e.g., each iteration selects afeature function for which changing the corresponding weight yields thehighest gain, the update technique does not account for possiblecorrelations between the feature functions. As a result, optimizationmay stop in a poor local optimum. The use of random directions allowsmultiple feature functions to be updated simultaneously. The use ofrandom directions can be implemented by selecting lines which connectone or more random points on the surface of an M-dimensional hypersphere with the hyper sphere's center (defined by the initial parameterset).

FIG. 4 shows an example process 400 for MBR decoding. For convenience,MBR decoding will be described with respect to a system that performsthe decoding. The system accesses 410 a hypothesis space. The hypothesisspace represents a plurality of candidate translations, e.g., in atarget language of corresponding input text in a source language. Forexample, a decoder (e.g., second decoder 110 in FIG. 1) can access atranslation lattice (e.g., translation lattice 108). The system performs420 decoding on the hypothesis space to obtain a translation hypothesisthat minimizes an expected error in classification calculated relativeto an evidence space. For example, the decoder can perform the decoding.The system provides 430 the obtained translation hypothesis for use by auser as a suggested translation in a target translation. For example,the decoder can provide a translation text (e.g., English sample 112)for use by a user.

FIG. 5A shows an example process 500 for MERT on a lattice. Forconvenience, performing MERT will be described with respect to a systemthat performs the training. The system determines 510, for a pluralityof feature functions in a translation lattice, a corresponding pluralityof error surfaces for each of one or more candidate translationsrepresented in the translation lattice. For example, error surfacegeneration module 560 of Minimum Error Rate Trainer 550 in FIG. 5B candetermine the corresponding plurality of error surfaces. The systemadjusts 520 weights for the feature functions by traversing acombination of the plurality of error surfaces for phrases in a trainingset. For example, update module 570 of Minimum Error Rate Trainer 550can adjust the weights. The system selects 530 weighting values thatminimize error counts for the traversed combination. For example, errorminimization module 580 of Minimum Error Rate Trainer 550 can selectweighting values. The system applies 540 the selected weighting valuesto convert a sample of text from a first language to a second language.For example, Minimum Error Rate Trainer 550 can apply the selectedweighting values to a decoder.

FIG. 6 shows an example of a generic computer device 600 and a genericmobile computer device 650, which may be used with the techniques (e.g.,processes 400 and 500) described. Computing device 600 is intended torepresent various forms of digital computers, such as laptops, desktops,workstations, personal digital assistants, servers, blade servers,mainframes, and other appropriate computers. Computing device 650 isintended to represent various forms of mobile devices, such as personaldigital assistants, cellular telephones, smartphones, and other similarcomputing devices. The components shown here, their connections andrelationships, and their functions, are meant to be exemplary only, andare not meant to limit implementations of the systems and techniquesdescribed and/or claimed in this document.

Computing device 600 includes a processor 602, memory 604, a storagedevice 606, a high-speed interface 608 connecting to memory 604 andhigh-speed expansion ports 610, and a low speed interface 612 connectingto low speed bus 614 and storage device 606. Each of the components 602,604, 606, 608, 610, and 612, are interconnected using various busses,and may be mounted on a common motherboard or in other manners asappropriate. The processor 602 can process instructions for executionwithin the computing device 600, including instructions stored in thememory 604 or on the storage device 606 to display graphical informationfor a GUI on an external input/output device, such as display 616coupled to high speed interface 608. In other implementations, multipleprocessors and/or multiple buses may be used, as appropriate, along withmultiple memories and types of memory. Also, multiple computing devices600 may be connected, with each device providing portions of thenecessary operations (e.g., as a server bank, a group of blade servers,or a multi-processor system).

The memory 604 stores information within the computing device 600. Inone implementation, the memory 604 is a volatile memory unit or units.In another implementation, the memory 604 is a non-volatile memory unitor units. The memory 604 may also be another form of computer-readablemedium, such as a magnetic or optical disk.

The storage device 606 is capable of providing mass storage for thecomputing device 600. In one implementation, the storage device 606 maybe or contain a computer-readable medium, such as a floppy disk device,a hard disk device, an optical disk device, or a tape device, a flashmemory or other similar solid state memory device, or an array ofdevices, including devices in a storage area network or otherconfigurations. A computer program product can be tangibly embodied inan information carrier. The computer program product may also containinstructions that, when executed, perform one or more methods, such asthose described above. The information carrier is a computer- ormachine-readable medium, such as the memory 604, the storage device 606,or memory on processor 602.

The high speed controller 608 manages bandwidth-intensive operations forthe computing device 600, while the low speed controller 612 manageslower bandwidth-intensive operations. Such allocation of functions isexemplary only. In one implementation, the high-speed controller 608 iscoupled to memory 604, display 616 (e.g., through a graphics processoror accelerator), and to high-speed expansion ports 610, which may acceptvarious expansion cards (not shown). In the implementation, low-speedcontroller 612 is coupled to storage device 606 and low-speed expansionport 614. The low-speed expansion port, which may include variouscommunication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet)may be coupled to one or more input/output devices, such as a keyboard,a pointing device, a scanner, or a networking device such as a switch orrouter, e.g., through a network adapter.

The computing device 600 may be implemented in a number of differentforms, as shown in the figure. For example, it may be implemented as astandard server 620, or multiple times in a group of such servers. Itmay also be implemented as part of a rack server system 624. Inaddition, it may be implemented in a personal computer such as a laptopcomputer 622. Alternatively, components from computing device 600 may becombined with other components in a mobile device (not shown), such asdevice 650. Each of such devices may contain one or more of computingdevice 600, 650, and an entire system may be made up of multiplecomputing devices 600, 650 communicating with each other.

Computing device 650 includes a processor 652, memory 664, aninput/output device such as a display 654, a communication interface666, and a transceiver 668, among other components. The device 650 mayalso be provided with a storage device, such as a microdrive or otherdevice, to provide additional storage. Each of the components 650, 652,664, 654, 666, and 668, are interconnected using various buses, andseveral of the components may be mounted on a common motherboard or inother manners as appropriate.

The processor 652 can execute instructions within the computing device650, including instructions stored in the memory 664. The processor maybe implemented as a chipset of chips that include separate and multipleanalog and digital processors. The processor may provide, for example,for coordination of the other components of the device 650, such ascontrol of user interfaces, applications run by device 650, and wirelesscommunication by device 650.

Processor 652 may communicate with a user through control interface 658and display interface 656 coupled to a display 654. The display 654 maybe, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display)display or an OLED (Organic Light Emitting Diode) display, or otherappropriate display technology. The display interface 656 may compriseappropriate circuitry for driving the display 654 to present graphicaland other information to a user. The control interface 658 may receivecommands from a user and convert them for submission to the processor652. In addition, an external interface 662 may be provide incommunication with processor 652, so as to enable near areacommunication of device 650 with other devices. External interface 662may provide, for example, for wired communication in someimplementations, or for wireless communication in other implementations,and multiple interfaces may also be used.

The memory 664 stores information within the computing device 650. Thememory 664 can be implemented as one or more of a computer-readablemedium or media, a volatile memory unit or units, or a non-volatilememory unit or units. Expansion memory 674 may also be provided andconnected to device 650 through expansion interface 672, which mayinclude, for example, a SIMM (Single In Line Memory Module) cardinterface. Such expansion memory 674 may provide extra storage space fordevice 650, or may also store applications or other information fordevice 650. Specifically, expansion memory 674 may include instructionsto carry out or supplement the processes described above, and mayinclude secure information also. Thus, for example, expansion memory 674may be provide as a security module for device 650, and may beprogrammed with instructions that permit secure use of device 650. Inaddition, secure applications may be provided via the SIMM cards, alongwith additional information, such as placing identifying information onthe SIMM card in a non-hackable manner.

The memory may include, for example, flash memory and/or NVRAM memory,as discussed below. In one implementation, a computer program product istangibly embodied in an information carrier. The computer programproduct contains instructions that, when executed, perform one or moremethods, such as those described above. The information carrier is acomputer- or machine-readable medium, such as the memory 664, expansionmemory 674, memory on processor 652, or a propagated signal that may bereceived, for example, over transceiver 668 or external interface 662.

Device 650 may communicate wirelessly through communication interface666, which may include digital signal processing circuitry wherenecessary. Communication interface 666 may provide for communicationsunder various modes or protocols, such as GSM voice calls, SMS, EMS, orMMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others.Such communication may occur, for example, through radio-frequencytransceiver 668. In addition, short-range communication may occur, suchas using a Bluetooth, WiFi, or other such transceiver (not shown). Inaddition, GPS (Global Positioning System) receiver module 670 mayprovide additional navigation- and location-related wireless data todevice 650, which may be used as appropriate by applications running ondevice 650.

Device 650 may also communicate audibly using audio codec 660, which mayreceive spoken information from a user and convert it to usable digitalinformation. Audio codec 660 may likewise generate audible sound for auser, such as through a speaker, e.g., in a handset of device 650. Suchsound may include sound from voice telephone calls, may include recordedsound (e.g., voice messages, music files, etc.) and may also includesound generated by applications operating on device 650.

The computing device 650 may be implemented in a number of differentforms, as shown in the figure. For example, it may be implemented as acellular telephone 680. It may also be implemented as part of asmartphone 682, personal digital assistant, or other similar mobiledevice.

Various implementations of the systems and techniques described here canbe realized in digital electronic circuitry, integrated circuitry,specially designed ASICs (application specific integrated circuits),computer hardware, firmware, software, and/or combinations thereof.These various implementations can include implementation in one or morecomputer programs that are executable and/or interpretable on aprogrammable system including at least one programmable processor, whichmay be special or general purpose, coupled to receive data andinstructions from, and to transmit data and instructions to, a storagesystem, at least one input device, and at least one output device.

These computer programs (also known as programs, software, softwareapplications or code) include machine instructions for a programmableprocessor, and can be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage. As used herein, the terms “machine-readable medium”“computer-readable medium” refers to any computer program product,apparatus and/or device (e.g., magnetic discs, optical disks, memory,Programmable Logic Devices (PLDs)) used to provide machine instructionsand/or data to a programmable processor, including a machine-readablemedium that receives machine instructions as a machine-readable signal.The term “machine-readable signal” refers to any signal used to providemachine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniquesdescribed here can be implemented on a computer having a display device(e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor)for displaying information to the user and a keyboard and a pointingdevice (e.g., a mouse or a trackball) by which the user can provideinput to the computer. Other kinds of devices can be used to provide forinteraction with a user as well; for example, feedback provided to theuser can be any form of sensory feedback (e.g., visual feedback,auditory feedback, or tactile feedback); and input from the user can bereceived in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in acomputing system that includes a back end component (e.g., as a dataserver), or that includes a middleware component (e.g., an applicationserver), or that includes a front end component (e.g., a client computerhaving a graphical user interface or a Web browser through which a usercan interact with an implementation of the systems and techniquesdescribed here), or any combination of such back end, middleware, orfront end components. The components of the system can be interconnectedby any form or medium of digital data communication (e.g., acommunication network). Examples of communication networks include alocal area network (“LAN”), a wide area network (“WAN”), and theInternet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyimplementation or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularimplementations. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter described in thisspecification have been described. Other embodiments are within thescope of the following claims. For example, the actions recited in theclaims can be performed in a different order and still achieve desirableresults. As one example, the processes depicted in the accompanyingfigures do not necessarily require the particular order shown, orsequential order, to achieve desirable results. In certainimplementations, multitasking and parallel processing may beadvantageous.

1. A computer-implemented method comprising: determining, for aplurality of feature functions in a translation lattice, a correspondingplurality of error surfaces for each of one or more candidatetranslations represented in the translation lattice; adjusting weightsfor the feature functions by traversing a combination of the pluralityof error surfaces for phrases in a training set; selecting weightingvalues that minimize error counts for the traversed combination; andapplying the selected weighting values to convert a sample of text froma first language to a second language.
 2. The method of claim 1, wherethe translation lattice comprises a phrase lattice.
 3. The method ofclaim 2, where arcs in the phrase lattice represent phrase hypothesesand nodes in the phrase lattice represent states at which partialtranslation hypotheses were recombined.
 4. The method of claim 1, wherethe error surfaces are determined and traversed using a lineoptimization technique.
 5. The method of claim 4, where the lineoptimization technique determines and traverses, for each featurefunction and sentence in a group, an error surface on a set of candidatetranslations.
 6. The method of claim 5, where the line optimizationtechnique determines and traverses the error surface starting from arandom point in a parameter space.
 7. The method of claim 5, where theline optimization technique determines and traverses the error surfaceusing random directions to adjust the weights.
 8. The method of claim 1,where the weights are limited by restrictions.
 9. The method of claim 1,where the weights are adjusted using weights priors.
 10. The method ofclaim 1, where the weights are adjusted over all sentences in a group ofsentences.
 11. The method of claim 1, further comprising selecting atarget translation, from a plurality of candidate translations, thatmaximizes a-posteriori probability for the translation lattice.
 12. Themethod of claim 1, where the translation lattice represents more thanone billion candidate translations.
 13. The method of claim 1, where thephrases comprise sentences.
 14. The method of claim 1, where the phrasesall comprise sentences.
 15. A computer program product, encoded on atangible program carrier, operable to cause data processing apparatus toperform operations comprising: determining, for a plurality of featurefunctions in a translation lattice, a corresponding plurality of errorsurfaces for each of one or more candidate translations represented inthe translation lattice; adjusting weights for the feature functions bytraversing a combination of the plurality of error surfaces for phrasesin a training set; selecting weighting values that minimize error countsfor the traversed combination; and applying the selected weightingvalues to convert a sample of text from a first language to a secondlanguage.
 16. The program product of claim 15, where the translationlattice comprises a phrase lattice.
 17. The program product of claim 16,where arcs in the phrase lattice represent phrase hypotheses and nodesin the phrase lattice represent states at which partial translationhypotheses were recombined.
 18. The program product of claim 15, wherethe error surfaces are determined and traversed using a lineoptimization technique.
 19. The program product of claim 18, where theline optimization technique determines and traverses, for each featurefunction and sentence in a group, an error surface on a set of candidatetranslations.
 20. The program product of claim 19, where the lineoptimization technique determines and traverses the error surfacestarting from a random point in a parameter space.
 21. The programproduct of claim 19, where the line optimization technique determinesand traverses the error surface using random directions to adjust theweights.
 22. The program product of claim 15, where the weights arelimited by restrictions.
 23. The program product of claim 15, where theweights are adjusted using weights priors.
 24. The program product ofclaim 15, where the weights are adjusted over all sentences in a groupof sentences.
 25. The program product of claim 15, further comprisingselecting a target translation, from a plurality of candidatetranslations, that maximizes a-posteriori probability for thetranslation lattice.
 26. The program product of claim 15, where thetranslation lattice represents more than one billion candidatetranslations.
 27. The program product of claim 15, where the phrasescomprise sentences.
 28. The program product of claim 15, where thephrases all comprise sentences.
 29. A system, comprising: amachine-readable storage device including a program product; and one ormore computers operable to execute the program product and performoperations comprising: determining, for a plurality of feature functionsin a translation lattice, a corresponding plurality of error surfacesfor each of one or more candidate translations represented in thetranslation lattice; adjusting weights for the feature functions bytraversing a combination of the plurality of error surfaces for phrasesin a training set; selecting weighting values that minimize error countsfor the traversed combination; and applying the selected weightingvalues to convert a sample of text from a first language to a secondlanguage.
 30. A computer-implemented system, comprising: a languagemodel that includes: a collection of feature functions in a translationlattice; a plurality of error surfaces for a set of candidate languagetranslations, across the feature functions; and weighting values forfeature functions selected to minimize error for traversal of the errorsurfaces.